In [1]:
pip install scikit-learn==1.2.2
Requirement already satisfied: scikit-learn==1.2.2 in /opt/homebrew/anaconda3/lib/python3.11/site-packages (1.2.2)
Requirement already satisfied: numpy>=1.17.3 in /opt/homebrew/anaconda3/lib/python3.11/site-packages (from scikit-learn==1.2.2) (1.24.3)
Requirement already satisfied: scipy>=1.3.2 in /opt/homebrew/anaconda3/lib/python3.11/site-packages (from scikit-learn==1.2.2) (1.11.1)
Requirement already satisfied: joblib>=1.1.1 in /opt/homebrew/anaconda3/lib/python3.11/site-packages (from scikit-learn==1.2.2) (1.2.0)
Requirement already satisfied: threadpoolctl>=2.0.0 in /opt/homebrew/anaconda3/lib/python3.11/site-packages (from scikit-learn==1.2.2) (2.2.0)
Note: you may need to restart the kernel to use updated packages.
In [2]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
In [3]:
conda install nbformat
Collecting package metadata (current_repodata.json): done
Solving environment: \ 
The environment is inconsistent, please check the package plan carefully
The following packages are causing the inconsistency:

  - defaults/noarch::conda-pack==0.6.0=pyhd3eb1b0_0
  - defaults/osx-arm64::_anaconda_depends==2023.09=py311_openblas_1
  - defaults/osx-arm64::anaconda-project==0.11.1=py311hca03da5_0
  - defaults/osx-arm64::python-lsp-black==1.2.1=py311hca03da5_0
  - defaults/noarch::backports.functools_lru_cache==1.6.4=pyhd3eb1b0_0
  - defaults/osx-arm64::twisted==22.10.0=py311h80987f9_0
  - defaults/noarch::pyls-spyder==0.4.0=pyhd3eb1b0_0
  - defaults/osx-arm64::conda-index==0.3.0=py311hca03da5_0
  - defaults/osx-arm64::spyder==5.4.3=py311hca03da5_1
  - defaults/osx-arm64::zope.interface==5.4.0=py311h80987f9_0
  - defaults/osx-arm64::conda-build==3.26.1=py311hca03da5_0
  - defaults/osx-arm64::python-lsp-server==1.7.2=py311hca03da5_0
  - defaults/noarch::conda-verify==3.4.2=py_1
  - defaults/osx-arm64::anaconda-client==1.12.1=py311hca03da5_0
  - defaults/osx-arm64::conda-libmamba-solver==23.7.0=py311hca03da5_0
  - defaults/osx-arm64::scrapy==2.8.0=py311hca03da5_0
  - defaults/osx-arm64::conda-repo-cli==1.0.75=py311hca03da5_0
  - defaults/osx-arm64::anaconda-navigator==2.5.2=py311hca03da5_0
  - defaults/osx-arm64::clyent==1.2.2=py311hca03da5_1
  - defaults/osx-arm64::conda==23.7.4=py311hca03da5_0
  - defaults/osx-arm64::bcrypt==3.2.0=py311h80987f9_1
  - defaults/osx-arm64::navigator-updater==0.4.0=py311hca03da5_1
  - defaults/noarch::conda-token==0.4.0=pyhd3eb1b0_0
done


==> WARNING: A newer version of conda exists. <==
  current version: 23.7.4
  latest version: 24.3.0

Please update conda by running

    $ conda update -n base -c defaults conda

Or to minimize the number of packages updated during conda update use

     conda install conda=24.3.0



## Package Plan ##

  environment location: /opt/homebrew/anaconda3

  added / updated specs:
    - nbformat


The following NEW packages will be INSTALLED:

  pip                pkgs/main/osx-arm64::pip-23.3.1-py311hca03da5_0 
  setuptools         pkgs/main/osx-arm64::setuptools-68.2.2-py311hca03da5_0 

The following packages will be UPDATED:

  ca-certificates                     2023.12.12-hca03da5_0 --> 2024.3.11-hca03da5_0 



Downloading and Extracting Packages

Preparing transaction: done
Verifying transaction: failed

RemoveError: 'setuptools' is a dependency of conda and cannot be removed from
conda's operating environment.


Note: you may need to restart the kernel to use updated packages.
In [4]:
data = pd.read_csv('heart_2022_no_nans.csv')
#data.drop(columns=['State'], inplace=True)
data
#data.head(10)
Out[4]:
State Sex GeneralHealth PhysicalHealthDays MentalHealthDays LastCheckupTime PhysicalActivities SleepHours RemovedTeeth HadHeartAttack ... HeightInMeters WeightInKilograms BMI AlcoholDrinkers HIVTesting FluVaxLast12 PneumoVaxEver TetanusLast10Tdap HighRiskLastYear CovidPos
0 Alabama Female Very good 4.0 0.0 Within past year (anytime less than 12 months ... Yes 9.0 None of them No ... 1.60 71.67 27.99 No No Yes Yes Yes, received Tdap No No
1 Alabama Male Very good 0.0 0.0 Within past year (anytime less than 12 months ... Yes 6.0 None of them No ... 1.78 95.25 30.13 No No Yes Yes Yes, received tetanus shot but not sure what type No No
2 Alabama Male Very good 0.0 0.0 Within past year (anytime less than 12 months ... No 8.0 6 or more, but not all No ... 1.85 108.86 31.66 Yes No No Yes No, did not receive any tetanus shot in the pa... No Yes
3 Alabama Female Fair 5.0 0.0 Within past year (anytime less than 12 months ... Yes 9.0 None of them No ... 1.70 90.72 31.32 No No Yes Yes No, did not receive any tetanus shot in the pa... No Yes
4 Alabama Female Good 3.0 15.0 Within past year (anytime less than 12 months ... Yes 5.0 1 to 5 No ... 1.55 79.38 33.07 No No Yes Yes No, did not receive any tetanus shot in the pa... No No
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
246017 Virgin Islands Male Very good 0.0 0.0 Within past 2 years (1 year but less than 2 ye... Yes 6.0 None of them No ... 1.78 102.06 32.28 Yes No No No Yes, received tetanus shot but not sure what type No No
246018 Virgin Islands Female Fair 0.0 7.0 Within past year (anytime less than 12 months ... Yes 7.0 None of them No ... 1.93 90.72 24.34 No No No No No, did not receive any tetanus shot in the pa... No Yes
246019 Virgin Islands Male Good 0.0 15.0 Within past year (anytime less than 12 months ... Yes 7.0 1 to 5 No ... 1.68 83.91 29.86 Yes Yes Yes Yes Yes, received tetanus shot but not sure what type No Yes
246020 Virgin Islands Female Excellent 2.0 2.0 Within past year (anytime less than 12 months ... Yes 7.0 None of them No ... 1.70 83.01 28.66 No Yes Yes No Yes, received tetanus shot but not sure what type No No
246021 Virgin Islands Male Very good 0.0 0.0 Within past year (anytime less than 12 months ... No 5.0 None of them Yes ... 1.83 108.86 32.55 No Yes Yes Yes No, did not receive any tetanus shot in the pa... No Yes

246022 rows × 40 columns

In [5]:
#DELETE

print(data['AgeCategory'])
0            Age 65 to 69
1            Age 70 to 74
2            Age 75 to 79
3         Age 80 or older
4         Age 80 or older
               ...       
246017       Age 60 to 64
246018       Age 25 to 29
246019       Age 65 to 69
246020       Age 50 to 54
246021       Age 70 to 74
Name: AgeCategory, Length: 246022, dtype: object
In [6]:
# Encode the AgeCategory. 
# We could assign a number to each age group, 
# but for now, we will apply a unique identifier for each group.


encode_AgeCategory = {
    'Age 18 to 24': 21,
    'Age 25 to 29': 27,
    'Age 30 to 34': 32,
    'Age 35 to 39': 37,
    'Age 40 to 44': 42,
    'Age 45 to 49': 47,
    'Age 50 to 54': 52,
    'Age 55 to 59': 57,
    'Age 60 to 64': 62,
    'Age 65 to 69': 67,
    'Age 70 to 74': 72,
    'Age 75 to 79': 77,
    'Age 80 or older': 80
}

data['Age_Category_Avg'] = data['AgeCategory'].map(encode_AgeCategory)
#data.to_csv('heart_2022_no_nans.csv', index=False)
    
#data_2.to_csv('modified_data.csv', index=False)
In [7]:
print(data['AgeCategory'])
0            Age 65 to 69
1            Age 70 to 74
2            Age 75 to 79
3         Age 80 or older
4         Age 80 or older
               ...       
246017       Age 60 to 64
246018       Age 25 to 29
246019       Age 65 to 69
246020       Age 50 to 54
246021       Age 70 to 74
Name: AgeCategory, Length: 246022, dtype: object
In [8]:
print(data['Age_Category_Avg'])
0         67
1         72
2         77
3         80
4         80
          ..
246017    62
246018    27
246019    67
246020    52
246021    72
Name: Age_Category_Avg, Length: 246022, dtype: int64
In [9]:
print(data['AgeCategory'].dtype)
object
In [10]:
data.describe()
Out[10]:
PhysicalHealthDays MentalHealthDays SleepHours HeightInMeters WeightInKilograms BMI Age_Category_Avg
count 246022.000000 246022.000000 246022.000000 246022.000000 246022.000000 246022.000000 246022.000000
mean 4.119026 4.167140 7.021331 1.705150 83.615179 28.668136 55.392262
std 8.405844 8.102687 1.440681 0.106654 21.323156 6.513973 17.218703
min 0.000000 0.000000 1.000000 0.910000 28.120000 12.020000 21.000000
25% 0.000000 0.000000 6.000000 1.630000 68.040000 24.270000 42.000000
50% 0.000000 0.000000 7.000000 1.700000 81.650000 27.460000 57.000000
75% 3.000000 4.000000 8.000000 1.780000 95.250000 31.890000 72.000000
max 30.000000 30.000000 24.000000 2.410000 292.570000 97.650000 80.000000
In [11]:
# checking if we still have any null data (even though the author of the file says it is already cleaned)

data.isnull().sum()
Out[11]:
State                        0
Sex                          0
GeneralHealth                0
PhysicalHealthDays           0
MentalHealthDays             0
LastCheckupTime              0
PhysicalActivities           0
SleepHours                   0
RemovedTeeth                 0
HadHeartAttack               0
HadAngina                    0
HadStroke                    0
HadAsthma                    0
HadSkinCancer                0
HadCOPD                      0
HadDepressiveDisorder        0
HadKidneyDisease             0
HadArthritis                 0
HadDiabetes                  0
DeafOrHardOfHearing          0
BlindOrVisionDifficulty      0
DifficultyConcentrating      0
DifficultyWalking            0
DifficultyDressingBathing    0
DifficultyErrands            0
SmokerStatus                 0
ECigaretteUsage              0
ChestScan                    0
RaceEthnicityCategory        0
AgeCategory                  0
HeightInMeters               0
WeightInKilograms            0
BMI                          0
AlcoholDrinkers              0
HIVTesting                   0
FluVaxLast12                 0
PneumoVaxEver                0
TetanusLast10Tdap            0
HighRiskLastYear             0
CovidPos                     0
Age_Category_Avg             0
dtype: int64
In [12]:
data.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 246022 entries, 0 to 246021
Data columns (total 41 columns):
 #   Column                     Non-Null Count   Dtype  
---  ------                     --------------   -----  
 0   State                      246022 non-null  object 
 1   Sex                        246022 non-null  object 
 2   GeneralHealth              246022 non-null  object 
 3   PhysicalHealthDays         246022 non-null  float64
 4   MentalHealthDays           246022 non-null  float64
 5   LastCheckupTime            246022 non-null  object 
 6   PhysicalActivities         246022 non-null  object 
 7   SleepHours                 246022 non-null  float64
 8   RemovedTeeth               246022 non-null  object 
 9   HadHeartAttack             246022 non-null  object 
 10  HadAngina                  246022 non-null  object 
 11  HadStroke                  246022 non-null  object 
 12  HadAsthma                  246022 non-null  object 
 13  HadSkinCancer              246022 non-null  object 
 14  HadCOPD                    246022 non-null  object 
 15  HadDepressiveDisorder      246022 non-null  object 
 16  HadKidneyDisease           246022 non-null  object 
 17  HadArthritis               246022 non-null  object 
 18  HadDiabetes                246022 non-null  object 
 19  DeafOrHardOfHearing        246022 non-null  object 
 20  BlindOrVisionDifficulty    246022 non-null  object 
 21  DifficultyConcentrating    246022 non-null  object 
 22  DifficultyWalking          246022 non-null  object 
 23  DifficultyDressingBathing  246022 non-null  object 
 24  DifficultyErrands          246022 non-null  object 
 25  SmokerStatus               246022 non-null  object 
 26  ECigaretteUsage            246022 non-null  object 
 27  ChestScan                  246022 non-null  object 
 28  RaceEthnicityCategory      246022 non-null  object 
 29  AgeCategory                246022 non-null  object 
 30  HeightInMeters             246022 non-null  float64
 31  WeightInKilograms          246022 non-null  float64
 32  BMI                        246022 non-null  float64
 33  AlcoholDrinkers            246022 non-null  object 
 34  HIVTesting                 246022 non-null  object 
 35  FluVaxLast12               246022 non-null  object 
 36  PneumoVaxEver              246022 non-null  object 
 37  TetanusLast10Tdap          246022 non-null  object 
 38  HighRiskLastYear           246022 non-null  object 
 39  CovidPos                   246022 non-null  object 
 40  Age_Category_Avg           246022 non-null  int64  
dtypes: float64(6), int64(1), object(34)
memory usage: 77.0+ MB
In [13]:
data.shape
Out[13]:
(246022, 41)
In [14]:
data.value_counts('HadHeartAttack')
Out[14]:
HadHeartAttack
No     232587
Yes     13435
Name: count, dtype: int64

Exploratory Data Analysis¶

A few details from the first glance:

The dataset contains imbalanced data, reflecting an uneven distribution across its categories. (Examples and plots will be provided further)

Features such as "HadStroke", "AgeCategory", "DifficultyWalking", and possibly "HadDiabetes" have a greater impact on predicting the higher target variable rate ("HadHeartAttack").

"Sex" and "Race" exhibit lower correlation values, indicating a weaker direct relationship with heart attack in this dataset. That means we can drop these features in future.

In [15]:
# Encode 'Sex' column

data_check = data.copy()

data_check['Sex'] = data_check['Sex'].map({'Female': 0, 'Male': 1})

# Encode 'RaceEthnicityCategory' column
data_check['RaceEthnicityCategory'] = data_check['RaceEthnicityCategory'].map({
    'White only, Non-Hispanic': 0,
    'Black only, Non-Hispanic': 1,
    'Other race only, Non-Hispanic': 2,
    'Multiracial, Non-Hispanic': 3,
    'Hispanic': 4
})

# Encode 'HadHeartAttack' column
data_check['HadHeartAttack'] = data_check['HadHeartAttack'].map({'Yes': 1, 'No': 0})

# Create a correlation matrix
corr_matrix = data_check[['Sex', 'RaceEthnicityCategory', 'HadHeartAttack']].corr()

# Plot heatmap
plt.figure(figsize=(8, 6))
sns.heatmap(corr_matrix, annot=True, cmap='coolwarm', fmt=".2f", annot_kws={"size": 12})
plt.title('Correlation Heatmap')
plt.show()
In [16]:
sns.countplot(x="HadHeartAttack", data=data)
plt.title("Distribution of Heart Attack")
plt.xlabel("Had Heart Attack")
plt.ylabel("Count")
plt.xticks(ticks=[0, 1], labels=["No", "Yes"])  # Rename the x-axis tick labels
plt.show()
In [17]:
#MAYBE

plt.figure(figsize=(12,12))
sns.boxplot(data=data)
plt.title('Boxplots of Numerical Features')
plt.show()
In [18]:
cat_data=data.select_dtypes(include='object')
num_data=data.select_dtypes(exclude='object')

#categorical features:  ['HeartDisease', 'Smoking', 'AlcoholDrinking', 'Stroke', 
                        #'DiffWalking', 'Sex', 'Race', 'Diabetic', 'PhysicalActivity', 
                        #'GenHealth', 'Asthma', 'KidneyDisease', 'SkinCancer']
#numerical features:  ['BMI', 'PhysicalHealth', 'MentalHealth', 'AgeCategory', 'SleepTime']
    

print("categorical features: ", cat_data.columns.to_list())
print("numerical features: ", num_data.columns.to_list())


data.head()
categorical features:  ['State', 'Sex', 'GeneralHealth', 'LastCheckupTime', 'PhysicalActivities', 'RemovedTeeth', 'HadHeartAttack', 'HadAngina', 'HadStroke', 'HadAsthma', 'HadSkinCancer', 'HadCOPD', 'HadDepressiveDisorder', 'HadKidneyDisease', 'HadArthritis', 'HadDiabetes', 'DeafOrHardOfHearing', 'BlindOrVisionDifficulty', 'DifficultyConcentrating', 'DifficultyWalking', 'DifficultyDressingBathing', 'DifficultyErrands', 'SmokerStatus', 'ECigaretteUsage', 'ChestScan', 'RaceEthnicityCategory', 'AgeCategory', 'AlcoholDrinkers', 'HIVTesting', 'FluVaxLast12', 'PneumoVaxEver', 'TetanusLast10Tdap', 'HighRiskLastYear', 'CovidPos']
numerical features:  ['PhysicalHealthDays', 'MentalHealthDays', 'SleepHours', 'HeightInMeters', 'WeightInKilograms', 'BMI', 'Age_Category_Avg']
Out[18]:
State Sex GeneralHealth PhysicalHealthDays MentalHealthDays LastCheckupTime PhysicalActivities SleepHours RemovedTeeth HadHeartAttack ... WeightInKilograms BMI AlcoholDrinkers HIVTesting FluVaxLast12 PneumoVaxEver TetanusLast10Tdap HighRiskLastYear CovidPos Age_Category_Avg
0 Alabama Female Very good 4.0 0.0 Within past year (anytime less than 12 months ... Yes 9.0 None of them No ... 71.67 27.99 No No Yes Yes Yes, received Tdap No No 67
1 Alabama Male Very good 0.0 0.0 Within past year (anytime less than 12 months ... Yes 6.0 None of them No ... 95.25 30.13 No No Yes Yes Yes, received tetanus shot but not sure what type No No 72
2 Alabama Male Very good 0.0 0.0 Within past year (anytime less than 12 months ... No 8.0 6 or more, but not all No ... 108.86 31.66 Yes No No Yes No, did not receive any tetanus shot in the pa... No Yes 77
3 Alabama Female Fair 5.0 0.0 Within past year (anytime less than 12 months ... Yes 9.0 None of them No ... 90.72 31.32 No No Yes Yes No, did not receive any tetanus shot in the pa... No Yes 80
4 Alabama Female Good 3.0 15.0 Within past year (anytime less than 12 months ... Yes 5.0 1 to 5 No ... 79.38 33.07 No No Yes Yes No, did not receive any tetanus shot in the pa... No No 80

5 rows × 41 columns

In [19]:
for c in cat_data:
    plt.rcParams['figure.figsize'] = (20, 8) 
    sns.countplot(x=c, hue='HadHeartAttack', data=data)
    plt.title(f'Heart Disease Count Grouped by {c} Status')
    plt.xlabel(c)
    plt.ylabel('Count')
    plt.xticks(rotation=90)
    plt.show()
In [20]:
sns.heatmap(num_data.corr(), annot=True, cmap='coolwarm', fmt=".2f")
plt.title("Correlation Heatmap")
plt.show()
In [21]:
data_check = data.copy()

# Encode 'HadHeartAttack' column
data_check['HadHeartAttack'] = data_check['HadHeartAttack'].map({'Yes': 1, 'No': 0})

# Select numerical and categorical columns
num_data = data.select_dtypes(exclude='object')
num_data['HadHeartAttack'] = data_check['HadHeartAttack']  # Include 'HadHeartAttack' in numerical data

# Save as data_num_cat_check
data_num_cat_check = num_data

# Build Correlation Heatmap
corr_matrix = data_num_cat_check.corr()

# Plot heatmap
plt.figure(figsize=(12, 10))
sns.heatmap(corr_matrix, annot=True, cmap='coolwarm', fmt=".2f", annot_kws={"size": 10})
plt.title('Correlation Heatmap')
plt.show()
In [22]:
sns.pairplot(data,height=2)
plt.show()
/opt/homebrew/anaconda3/lib/python3.11/site-packages/seaborn/axisgrid.py:118: UserWarning: The figure layout has changed to tight
  self._figure.tight_layout(*args, **kwargs)
In [23]:
disease=['HadAngina', 'HadStroke', 'HadAsthma', 'HadSkinCancer', 'HadCOPD', 
          'HadDepressiveDisorder', 'HadKidneyDisease', 'HadArthritis', 'HadDiabetes']
for d in disease:
    df_filtered = data[data[d] == 'Yes']
    
    if not df_filtered.empty:
        plt.figure(figsize=(20,10))
        sns.countplot(x='HadHeartAttack', data=df_filtered)
        plt.title(f'Heart Disease Count among Patients with {d}')
        plt.xlabel('HadHeartAttack')
        plt.ylabel('Count')
        plt.show()
In [24]:
for feature in num_data:
    plt.figure(figsize=(10, 6))
    sns.histplot(data=data, x=feature, hue='HadHeartAttack', kde=True, element='step', stat='count')
    plt.title(f'Histogram of {feature} with Heart Attack Overlay')
    plt.xlabel(feature)
    plt.ylabel('Count')
    plt.legend()
    plt.show()
No artists with labels found to put in legend.  Note that artists whose label start with an underscore are ignored when legend() is called with no argument.
No artists with labels found to put in legend.  Note that artists whose label start with an underscore are ignored when legend() is called with no argument.
No artists with labels found to put in legend.  Note that artists whose label start with an underscore are ignored when legend() is called with no argument.
No artists with labels found to put in legend.  Note that artists whose label start with an underscore are ignored when legend() is called with no argument.
No artists with labels found to put in legend.  Note that artists whose label start with an underscore are ignored when legend() is called with no argument.
No artists with labels found to put in legend.  Note that artists whose label start with an underscore are ignored when legend() is called with no argument.
No artists with labels found to put in legend.  Note that artists whose label start with an underscore are ignored when legend() is called with no argument.
No artists with labels found to put in legend.  Note that artists whose label start with an underscore are ignored when legend() is called with no argument.
In [25]:
sorted_age_categories = sorted(data['AgeCategory'].unique())

sns.scatterplot(data=data, x='BMI', y='AgeCategory', hue='HadHeartAttack')

# Set the title, labels, and legend
plt.title('Scatter Plot of BMI vs Age by Heart Attack Status')
plt.xlabel('BMI')
plt.ylabel('Age')
plt.yticks(ticks=range(len(sorted_age_categories)), labels=sorted_age_categories)  # Set y-axis tick labels
plt.legend(title='Heart Attack')

# Show the plot
plt.show()
In [26]:
#num_data = num_data.drop(columns=['HadHeartAttack'])
num_data.hist(figsize=(16, 20), bins=40, xlabelsize=6, ylabelsize=6);
In [27]:
age_heart_disease = data.groupby('AgeCategory')['HadHeartAttack'].value_counts().unstack().fillna(0)

age_heart_disease.plot(kind='bar', stacked=True)
plt.title('Number of People with Heart Disease by Age Category')
plt.xlabel('Age Category')
plt.ylabel('Count')
plt.xticks(rotation=0)
plt.legend(title='Heart Attack', labels=['No', 'Yes'])
plt.tight_layout() 
plt.show()
In [28]:
gender_heart_attack = data.groupby('Sex')['HadHeartAttack'].value_counts().unstack().fillna(0)

gender_heart_attack.plot(kind='bar', stacked=True)
plt.title('Number of People with Heart Attack by Sex Category')
plt.xlabel('Sex')
plt.ylabel('Count')
plt.xticks(rotation=45)
plt.legend(title='Heart Attack', labels=['Female', 'Male'])
plt.tight_layout() 
plt.show()

Preprocessing¶

I will encoded categorical features using Label Encoder.

I also will apply RobustScaler to minimize the skewness of outliers. Note: no outlier removal was performed as they were present in a significant amount, which is important for current data analysis.

I will use oversampling with SMOTE to handle the imbalance of the majority class.

In [29]:
from sklearn import preprocessing 
from sklearn.preprocessing import RobustScaler
from imblearn.over_sampling import SMOTE
from sklearn.model_selection import train_test_split
In [30]:
import sklearn
print(sklearn.__version__)
1.2.2
In [31]:
pip install scikit-learn==0.22.2
ERROR: Could not find a version that satisfies the requirement scikit-learn==0.22.2 (from versions: 0.9, 0.10, 0.11, 0.12, 0.12.1, 0.13, 0.13.1, 0.14, 0.14.1, 0.15.0, 0.15.1, 0.15.2, 0.16.0, 0.16.1, 0.17, 0.17.1, 0.18, 0.18.1, 0.18.2, 0.19.0, 0.19.1, 0.19.2, 0.20.0, 0.20.1, 0.20.2, 0.20.3, 0.20.4, 0.21.1, 0.21.2, 0.21.3, 0.22, 0.22.1, 0.22.2.post1, 0.23.0, 0.23.1, 0.23.2, 0.24.0, 0.24.1, 0.24.2, 1.0, 1.0.1, 1.0.2, 1.1.0, 1.1.1, 1.1.2, 1.1.3, 1.2.0rc1, 1.2.0, 1.2.1, 1.2.2, 1.3.0rc1, 1.3.0, 1.3.1, 1.3.2, 1.4.0rc1, 1.4.0, 1.4.1.post1)
ERROR: No matching distribution found for scikit-learn==0.22.2
Note: you may need to restart the kernel to use updated packages.
In [32]:
print(sklearn.__version__)
1.2.2
In [33]:
label_encoder = preprocessing.LabelEncoder() 
for c in cat_data:
    data[c]= label_encoder.fit_transform(data[c]) 
    data[c].unique()
    
data.head()
Out[33]:
State Sex GeneralHealth PhysicalHealthDays MentalHealthDays LastCheckupTime PhysicalActivities SleepHours RemovedTeeth HadHeartAttack ... WeightInKilograms BMI AlcoholDrinkers HIVTesting FluVaxLast12 PneumoVaxEver TetanusLast10Tdap HighRiskLastYear CovidPos Age_Category_Avg
0 0 0 4 4.0 0.0 3 1 9.0 3 0 ... 71.67 27.99 0 0 1 1 1 0 0 67
1 0 1 4 0.0 0.0 3 1 6.0 3 0 ... 95.25 30.13 0 0 1 1 2 0 0 72
2 0 1 4 0.0 0.0 3 0 8.0 1 0 ... 108.86 31.66 1 0 0 1 0 0 2 77
3 0 0 1 5.0 0.0 3 1 9.0 3 0 ... 90.72 31.32 0 0 1 1 0 0 2 80
4 0 0 2 3.0 15.0 3 1 5.0 0 0 ... 79.38 33.07 0 0 1 1 0 0 0 80

5 rows × 41 columns

In [34]:
scaler=RobustScaler()
scaled_data=scaler.fit_transform(data)
sns.boxplot(data=scaled_data)
plt.show()
In [35]:
race_groups = data.groupby('RaceEthnicityCategory')['HadHeartAttack'].value_counts(normalize=True).unstack(fill_value=0)
race_groups['Ratio'] = race_groups[1] / race_groups[0]

print(race_groups[['Ratio']])
HadHeartAttack            Ratio
RaceEthnicityCategory          
0                      0.048208
1                      0.039565
2                      0.064873
3                      0.050887
4                      0.061260

Since there isn't any difference in the ratio of heart disease across different race/ethnicity categories, it suggests that race doesn't have any significant effect on heart attack in individuals.

In [36]:
data.drop(columns=['RaceEthnicityCategory'], inplace=True)
data.head()
Out[36]:
State Sex GeneralHealth PhysicalHealthDays MentalHealthDays LastCheckupTime PhysicalActivities SleepHours RemovedTeeth HadHeartAttack ... WeightInKilograms BMI AlcoholDrinkers HIVTesting FluVaxLast12 PneumoVaxEver TetanusLast10Tdap HighRiskLastYear CovidPos Age_Category_Avg
0 0 0 4 4.0 0.0 3 1 9.0 3 0 ... 71.67 27.99 0 0 1 1 1 0 0 67
1 0 1 4 0.0 0.0 3 1 6.0 3 0 ... 95.25 30.13 0 0 1 1 2 0 0 72
2 0 1 4 0.0 0.0 3 0 8.0 1 0 ... 108.86 31.66 1 0 0 1 0 0 2 77
3 0 0 1 5.0 0.0 3 1 9.0 3 0 ... 90.72 31.32 0 0 1 1 0 0 2 80
4 0 0 2 3.0 15.0 3 1 5.0 0 0 ... 79.38 33.07 0 0 1 1 0 0 0 80

5 rows × 40 columns

In [37]:
data.drop(columns=['State'], inplace=True)
In [38]:
data.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 246022 entries, 0 to 246021
Data columns (total 39 columns):
 #   Column                     Non-Null Count   Dtype  
---  ------                     --------------   -----  
 0   Sex                        246022 non-null  int64  
 1   GeneralHealth              246022 non-null  int64  
 2   PhysicalHealthDays         246022 non-null  float64
 3   MentalHealthDays           246022 non-null  float64
 4   LastCheckupTime            246022 non-null  int64  
 5   PhysicalActivities         246022 non-null  int64  
 6   SleepHours                 246022 non-null  float64
 7   RemovedTeeth               246022 non-null  int64  
 8   HadHeartAttack             246022 non-null  int64  
 9   HadAngina                  246022 non-null  int64  
 10  HadStroke                  246022 non-null  int64  
 11  HadAsthma                  246022 non-null  int64  
 12  HadSkinCancer              246022 non-null  int64  
 13  HadCOPD                    246022 non-null  int64  
 14  HadDepressiveDisorder      246022 non-null  int64  
 15  HadKidneyDisease           246022 non-null  int64  
 16  HadArthritis               246022 non-null  int64  
 17  HadDiabetes                246022 non-null  int64  
 18  DeafOrHardOfHearing        246022 non-null  int64  
 19  BlindOrVisionDifficulty    246022 non-null  int64  
 20  DifficultyConcentrating    246022 non-null  int64  
 21  DifficultyWalking          246022 non-null  int64  
 22  DifficultyDressingBathing  246022 non-null  int64  
 23  DifficultyErrands          246022 non-null  int64  
 24  SmokerStatus               246022 non-null  int64  
 25  ECigaretteUsage            246022 non-null  int64  
 26  ChestScan                  246022 non-null  int64  
 27  AgeCategory                246022 non-null  int64  
 28  HeightInMeters             246022 non-null  float64
 29  WeightInKilograms          246022 non-null  float64
 30  BMI                        246022 non-null  float64
 31  AlcoholDrinkers            246022 non-null  int64  
 32  HIVTesting                 246022 non-null  int64  
 33  FluVaxLast12               246022 non-null  int64  
 34  PneumoVaxEver              246022 non-null  int64  
 35  TetanusLast10Tdap          246022 non-null  int64  
 36  HighRiskLastYear           246022 non-null  int64  
 37  CovidPos                   246022 non-null  int64  
 38  Age_Category_Avg           246022 non-null  int64  
dtypes: float64(6), int64(33)
memory usage: 73.2 MB

Resampling and splitting¶

In [39]:
#y = data['HadHeartAttack'] 
#X = data.drop(columns=['HadHeartAttack'], inplace=True) 

#smote = SMOTE(random_state=42)
#X_resampled, y_resampled = smote.fit_resample(X, y)

y = data['HadHeartAttack'] 
X = data.drop('HadHeartAttack', axis=1) 


smote = SMOTE(random_state=42)
X_resampled, y_resampled = smote.fit_resample(X, y)
In [40]:
x_train,x_test,y_train,y_test = train_test_split(X_resampled,y_resampled,test_size=0.30,random_state=42)
In [41]:
x_train_non,x_test_non,y_train_non,y_test_non = train_test_split(X,y,test_size=0.30,random_state=42)

Model Selection¶

  1. Decision Trees

  2. Random Forest

  3. K-Nearest Neighbors (KNN)

In [42]:
from sklearn.preprocessing import RobustScaler
from sklearn.model_selection import train_test_split
from imblearn.over_sampling import SMOTE
from sklearn.ensemble import RandomForestClassifier
from sklearn.tree import DecisionTreeClassifier
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import roc_curve, roc_auc_score, classification_report
from sklearn.metrics import confusion_matrix
from sklearn.linear_model import LogisticRegression
from sklearn.neighbors import KNeighborsClassifier
from sklearn.metrics import accuracy_score
from sklearn.metrics import precision_score,recall_score
from sklearn.metrics import f1_score
import os
for dirname, _, filenames in os.walk('/kaggle/input'):
    for filename in filenames:
        print(os.path.join(dirname, filename))
In [43]:
decision_tree = DecisionTreeClassifier(random_state=42)
In [44]:
knn = KNeighborsClassifier()
In [45]:
rf=RandomForestClassifier(n_estimators=77, max_depth=None, random_state=42, n_jobs=-1)

Model Training and evaluation¶

In [46]:
def train_without_kfolds(classifier, x_train, y_train, x_test, y_test):
    classifier.fit(x_train, y_train)

    prediction = classifier.predict(x_test)
    predicted_proba = classifier.predict_proba(x_test)[:, 1] 
    
    return prediction,predicted_proba
In [47]:
def evaluating_model(model,y_test,y_pred,predicted_proba,step_factor=0.1,threshold=0):
    roc_score = 0
    threshold_value=threshold
    Original_AUC = roc_auc_score(y_test, prediction)
    while threshold_value <= 1:
        temp_thresh = threshold_value
        predicted = (predicted_proba >= temp_thresh).astype('int')
        current_roc_score = roc_auc_score(y_test, predicted)
        print('Threshold', temp_thresh, '--', current_roc_score)

        if roc_score < current_roc_score:
            roc_score = current_roc_score
            thrsh_score = threshold_value

        threshold_value += step_factor
        
    print('---Optimum Threshold ---', thrsh_score, '--ROC--', roc_score)    
    false_positive_rate1, true_positive_rate1, threshold1 = roc_curve(y_test, predicted_proba)
    plt.subplots(1, figsize=(10,10))
    plt.title(f'Receiver Operating Characteristic - {model}')
    plt.plot(false_positive_rate1, true_positive_rate1)
    plt.plot([0, 1], ls="--")
    plt.plot([0, 0], [1, 0] , c=".7"), plt.plot([1, 1] , c=".7")
    plt.ylabel('True Positive Rate')
    plt.xlabel('False Positive Rate')
    plt.show()
    
    print(f"Test ROC AUC Score: {Original_AUC:.2%}")
    
    print(classification_report(y_test,  prediction))
    print('----Different scores----')
    print(f'Accuracy_score: {accuracy_score(y_test,prediction)}')
    print(f'Precission_score: {precision_score(y_test,prediction)}')
    print(f'Recall_score: {recall_score(y_test,prediction)}')
    print(f'F1-score: {f1_score(y_test,prediction)}')
    
    cm = confusion_matrix(y_test, prediction)
    sns.heatmap(cm, annot=True, fmt='d', cmap='Blues')
    plt.title('Confusion Matrix')
    plt.xlabel('Predicted')
    plt.ylabel('True')
    plt.show()

Testing each model to select the best one for prediction

ROC_AUC Score and its curve, Recall, F1, accuracy and precision score

We opted not to use StratifiedKFolds as we already dealt with imbalance in data through SMOTE

Decision Tree¶

In [48]:
prediction,predicted_proba= train_without_kfolds(decision_tree,x_train, y_train, x_test, y_test)
In [49]:
evaluating_model("Decision Tree",y_test,prediction,predicted_proba)
Threshold 0 -- 0.5
Threshold 0.1 -- 0.9161592787417638
Threshold 0.2 -- 0.9161592787417638
Threshold 0.30000000000000004 -- 0.9161592787417638
Threshold 0.4 -- 0.9161592787417638
Threshold 0.5 -- 0.9161592787417638
Threshold 0.6 -- 0.9161735836655187
Threshold 0.7 -- 0.9161735836655187
Threshold 0.7999999999999999 -- 0.9161735836655187
Threshold 0.8999999999999999 -- 0.9161735836655187
Threshold 0.9999999999999999 -- 0.9161735836655187
---Optimum Threshold --- 0.6 --ROC-- 0.9161735836655187
Test ROC AUC Score: 91.62%
              precision    recall  f1-score   support

           0       0.93      0.90      0.91     69906
           1       0.90      0.93      0.92     69647

    accuracy                           0.92    139553
   macro avg       0.92      0.92      0.92    139553
weighted avg       0.92      0.92      0.92    139553

----Different scores----
Accuracy_score: 0.9161393879028039
Precission_score: 0.9010520487264674
Recall_score: 0.9345987623300358
F1-score: 0.9175188706505882

KNN¶

In [50]:
prediction,predicted_proba= train_without_kfolds(knn,x_train, y_train, x_test, y_test)
In [51]:
evaluating_model("KNN",y_test,prediction,predicted_proba)
Threshold 0 -- 0.5
Threshold 0.1 -- 0.7924067868815349
Threshold 0.2 -- 0.7924067868815349
Threshold 0.30000000000000004 -- 0.8359218595769424
Threshold 0.4 -- 0.8359218595769424
Threshold 0.5 -- 0.8765540645639255
Threshold 0.6 -- 0.8765540645639255
Threshold 0.7 -- 0.9174526370612743
Threshold 0.7999999999999999 -- 0.9174526370612743
Threshold 0.8999999999999999 -- 0.9487201824645349
Threshold 0.9999999999999999 -- 0.9487201824645349
---Optimum Threshold --- 0.8999999999999999 --ROC-- 0.9487201824645349
Test ROC AUC Score: 87.66%
              precision    recall  f1-score   support

           0       1.00      0.75      0.86     69906
           1       0.80      1.00      0.89     69647

    accuracy                           0.88    139553
   macro avg       0.90      0.88      0.87    139553
weighted avg       0.90      0.88      0.87    139553

----Different scores----
Accuracy_score: 0.8763265569353579
Precission_score: 0.8018228746572028
Recall_score: 0.999138512785906
F1-score: 0.889671616602635

Random Forest¶

In [52]:
prediction,predicted_proba= train_without_kfolds(rf,x_train, y_train, x_test, y_test)
In [53]:
evaluating_model("Random Forest",y_test,prediction,predicted_proba)
Threshold 0 -- 0.5
Threshold 0.1 -- 0.7962440681472773
Threshold 0.2 -- 0.8827367656994646
Threshold 0.30000000000000004 -- 0.9254163519360591
Threshold 0.4 -- 0.945370704911418
Threshold 0.5 -- 0.9562530473393123
Threshold 0.6 -- 0.9576023234778211
Threshold 0.7 -- 0.9513614519912462
Threshold 0.7999999999999999 -- 0.9312936067902051
Threshold 0.8999999999999999 -- 0.8752163964519417
Threshold 0.9999999999999999 -- 0.6484629632288541
---Optimum Threshold --- 0.6 --ROC-- 0.9576023234778211
Test ROC AUC Score: 95.63%
              precision    recall  f1-score   support

           0       0.96      0.95      0.96     69906
           1       0.95      0.96      0.96     69647

    accuracy                           0.96    139553
   macro avg       0.96      0.96      0.96    139553
weighted avg       0.96      0.96      0.96    139553

----Different scores----
Accuracy_score: 0.9562388483228594
Precission_score: 0.9491983146226282
Recall_score: 0.9639036857294643
F1-score: 0.9564944825571868